Load the college application data from Lab1 and create the variable Elite by binning the Top10perc variable. We are going to divide universities into two groups based on whether or not the proportion of students coming from the top 10% of their high school classes exceeds 50 %. We will also save the College names as a new variable and remove Accept and Enroll as temporally they occur after applying, and do not make sense as predictors in future data.

data(College)

College = College %>% mutate(college = rownames(College)) %>%
  mutate(Elite = factor(Top10perc > 50)) %>%
  mutate(Elite = recode(Elite, "TRUE" = "Yes", "FALSE" ="No")) %>%
  dplyr::select(c(-Accept,-Enroll))



colnames(College)
##  [1] "Private"     "Apps"        "Top10perc"   "Top25perc"   "F.Undergrad"
##  [6] "P.Undergrad" "Outstate"    "Room.Board"  "Books"       "Personal"   
## [11] "PhD"         "Terminal"    "S.F.Ratio"   "perc.alumni" "Expend"     
## [16] "Grad.Rate"   "college"     "Elite"

We are going to create a training and test set by randomly splitting the data. First set a random seed by

# do not change this; for a break google `8675309`
set.seed(8675309)
n = nrow(College)
n.train = floor(.75*n)
train = sample(1:n, size=n.train, replace=FALSE)
College.train = College[train,]
College.test = College[-train,]
  1. Summarize the training and test data, and comment on which variables are numeric, factors or that could be treated as either. Comment on whether the summaries appear to be similar across the test and training data sets.
#summary for training set
summary(College.train[,-17])
##  Private        Apps           Top10perc       Top25perc     
##  No :174   Min.   :   81.0   Min.   : 1.00   Min.   :  9.00  
##  Yes:408   1st Qu.:  804.2   1st Qu.:15.00   1st Qu.: 40.00  
##            Median : 1590.5   Median :23.00   Median : 54.00  
##            Mean   : 2969.6   Mean   :26.99   Mean   : 55.24  
##            3rd Qu.: 3753.5   3rd Qu.:35.00   3rd Qu.: 68.00  
##            Max.   :20192.0   Max.   :95.00   Max.   :100.00  
##   F.Undergrad     P.Undergrad         Outstate       Room.Board  
##  Min.   :  139   Min.   :    1.0   Min.   : 2700   Min.   :1880  
##  1st Qu.: 1025   1st Qu.:  107.8   1st Qu.: 7036   1st Qu.:3587  
##  Median : 1773   Median :  372.0   Median : 9806   Median :4181  
##  Mean   : 3735   Mean   :  877.9   Mean   :10255   Mean   :4343  
##  3rd Qu.: 4260   3rd Qu.: 1077.2   3rd Qu.:12722   3rd Qu.:5011  
##  Max.   :31643   Max.   :21836.0   Max.   :21700   Max.   :7425  
##      Books           Personal         PhD            Terminal     
##  Min.   : 110.0   Min.   : 250   Min.   :  8.00   Min.   : 30.00  
##  1st Qu.: 482.5   1st Qu.: 875   1st Qu.: 63.00   1st Qu.: 71.00  
##  Median : 525.0   Median :1200   Median : 76.00   Median : 83.00  
##  Mean   : 555.6   Mean   :1333   Mean   : 72.97   Mean   : 80.07  
##  3rd Qu.: 600.0   3rd Qu.:1700   3rd Qu.: 85.00   3rd Qu.: 92.00  
##  Max.   :2340.0   Max.   :4913   Max.   :103.00   Max.   :100.00  
##    S.F.Ratio     perc.alumni        Expend        Grad.Rate      Elite    
##  Min.   : 2.5   Min.   : 0.00   Min.   : 3365   Min.   : 10.00   No :527  
##  1st Qu.:11.7   1st Qu.:13.00   1st Qu.: 6720   1st Qu.: 53.00   Yes: 55  
##  Median :13.8   Median :20.00   Median : 8326   Median : 65.00            
##  Mean   :14.3   Mean   :22.22   Mean   : 9455   Mean   : 65.48            
##  3rd Qu.:16.7   3rd Qu.:31.00   3rd Qu.:10648   3rd Qu.: 78.00            
##  Max.   :39.8   Max.   :64.00   Max.   :56233   Max.   :100.00
#summary for test set
summary(College.test[,-17])
##  Private        Apps         Top10perc       Top25perc     
##  No : 38   Min.   :  152   Min.   : 3.00   Min.   : 19.00  
##  Yes:157   1st Qu.:  672   1st Qu.:17.00   1st Qu.: 44.50  
##            Median : 1496   Median :25.00   Median : 55.00  
##            Mean   : 3097   Mean   :29.25   Mean   : 57.47  
##            3rd Qu.: 3498   3rd Qu.:36.00   3rd Qu.: 69.00  
##            Max.   :48094   Max.   :96.00   Max.   :100.00  
##   F.Undergrad       P.Undergrad        Outstate       Room.Board  
##  Min.   :  282.0   Min.   :   7.0   Min.   : 2340   Min.   :1780  
##  1st Qu.:  927.5   1st Qu.:  71.5   1st Qu.: 7825   1st Qu.:3606  
##  Median : 1506.0   Median : 334.0   Median :10658   Median :4220  
##  Mean   : 3595.6   Mean   : 787.9   Mean   :10996   Mean   :4401  
##  3rd Qu.: 3496.0   3rd Qu.: 877.5   3rd Qu.:13298   3rd Qu.:5114  
##  Max.   :27378.0   Max.   :9054.0   Max.   :20100   Max.   :8124  
##      Books           Personal         PhD           Terminal     
##  Min.   :  96.0   Min.   : 300   Min.   :22.00   Min.   : 24.00  
##  1st Qu.: 450.0   1st Qu.: 850   1st Qu.:60.00   1st Qu.: 69.00  
##  Median : 500.0   Median :1235   Median :74.00   Median : 80.00  
##  Mean   : 530.7   Mean   :1364   Mean   :71.74   Mean   : 78.62  
##  3rd Qu.: 600.0   3rd Qu.:1653   3rd Qu.:86.50   3rd Qu.: 92.50  
##  Max.   :1000.0   Max.   :6800   Max.   :99.00   Max.   :100.00  
##    S.F.Ratio      perc.alumni        Expend        Grad.Rate     
##  Min.   : 4.60   Min.   : 0.00   Min.   : 3186   Min.   : 21.00  
##  1st Qu.:11.20   1st Qu.:14.00   1st Qu.: 6898   1st Qu.: 53.00  
##  Median :13.00   Median :24.00   Median : 8575   Median : 66.00  
##  Mean   :13.45   Mean   :24.31   Mean   :10272   Mean   : 65.41  
##  3rd Qu.:15.45   3rd Qu.:32.00   3rd Qu.:11045   3rd Qu.: 76.00  
##  Max.   :27.60   Max.   :60.00   Max.   :42926   Max.   :118.00  
##  Elite    
##  No :172  
##  Yes: 23  
##           
##           
##           
## 

ANSWER: Catergorical variables: Private and Elite. College name should not be treated as a variable. All other variables left should be treated as numerical variables.

From the summaries given above, variables in both train set and test set have similar ranges and mean. One thing we need pay attention to is that the ratio of public school vs private school in training set is 0.43 while the ratio in test set is only 0.24. This may introduce bias since private school and public school have many differences.

  1. Create scatter plots of predictors versus Apps using the training data only. If you use pairs or preferably ggpairs make sure that Apps is on the y-axis in plots versus the other predictors. (Make sure that the plots are legible, which may require multiple plots.)
    Comment on any features in the plots, such as potential outliers, non-linearity, needs for transformations etc.
#melt data
College.train_melt = College.train %>% select(-c(Private,college, Elite)) %>% 
  gather(feature, value, c(Top10perc:Grad.Rate))

College.train_melt_f = College.train %>% select(-college) %>% 
  select(c(Private,Apps,Elite)) %>% 
  gather(feature, value, c(Private,Elite))

College.train_melt_1 = College.train_melt[1:2328,]
College.train_melt_2 = College.train_melt[2329:4656,]
College.train_melt_3 = College.train_melt[4657:6984,]
College.train_melt_4 = College.train_melt[6985:8148,]

ggplot(College.train_melt_f, aes(value, Apps)) +
  geom_point(color = "firebrick") +
  facet_wrap(~feature, scales = "free")

ggplot(College.train_melt_1, aes(value, Apps)) +
  geom_point(color = "firebrick") +
  facet_wrap(~feature, scales = "free")

ggplot(College.train_melt_2, aes(value, Apps)) +
  geom_point(color = "firebrick") +
  facet_wrap(~feature, scales = "free")

ggplot(College.train_melt_3, aes(value, Apps)) +
  geom_point(color = "firebrick") +
  facet_wrap(~feature, scales = "free")

ggplot(College.train_melt_4, aes(value, Apps)) +
  geom_point(color = "firebrick") +
  facet_wrap(~feature, scales = "free")

ANSWER:We first looked at two categorical variables: Elite and Private. Under Yes or No, most colleges have applications less than 15000. So, we can consider applications greater than 20000 are outliers.

We then paid attention to numerical variables. We could see clearly that the number of full time undergraduate has strong linear relation with applications. Variables ‘Top10perc’ and ‘Top25perc’ have weak linear relationship with applications. For other numerical variables, there is no linear relation with applications. Outliers exist in all variables which indicate appropriate tranformation is needed.

  1. Build a linear regression model to predict Apps from the other predictors (without any transformations). Present model summaries and diagnostic plots. Based on diagnostic plots using residuals, comment on the adequacy of your model.
app_lm = lm(Apps ~.-college, data  = College.train)

summary(app_lm)
## 
## Call:
## lm(formula = Apps ~ . - college, data = College.train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5303.7  -698.7   -66.3   474.7  8310.0 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.863e+03  6.513e+02  -2.861 0.004386 ** 
## PrivateYes  -5.926e+02  2.225e+02  -2.663 0.007961 ** 
## Top10perc   -4.819e+00  1.103e+01  -0.437 0.662281    
## Top25perc    1.024e+01  7.576e+00   1.351 0.177237    
## F.Undergrad  6.239e-01  2.063e-02  30.245  < 2e-16 ***
## P.Undergrad -9.703e-02  4.991e-02  -1.944 0.052378 .  
## Outstate     5.200e-02  3.040e-02   1.711 0.087666 .  
## Room.Board   3.514e-01  7.928e-02   4.433 1.12e-05 ***
## Books        2.117e-01  3.673e-01   0.576 0.564636    
## Personal    -2.575e-01  1.081e-01  -2.381 0.017591 *  
## PhD          5.883e+00  7.472e+00   0.787 0.431378    
## Terminal    -1.702e+01  8.432e+00  -2.019 0.044005 *  
## S.F.Ratio    1.563e+01  2.077e+01   0.753 0.452010    
## perc.alumni -2.286e+01  6.773e+00  -3.376 0.000787 ***
## Expend       8.686e-02  2.136e-02   4.067 5.43e-05 ***
## Grad.Rate    1.578e+01  4.762e+00   3.314 0.000980 ***
## EliteYes     1.033e+03  3.435e+02   3.008 0.002749 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1463 on 565 degrees of freedom
## Multiple R-squared:  0.8261, Adjusted R-squared:  0.8211 
## F-statistic: 167.7 on 16 and 565 DF,  p-value: < 2.2e-16
par(mfrow = c(2,2))
plot(app_lm, labels.id = College.train$college)

ANSWER: From the residual plot vs fitted value, we see that the residual variance increases. The model predicts negative Apps, which is not meaningful. From the QQplot, there is clear heavy tail. From scale-location plot, we can see an increasing, linear trend of the standardized residuals vs fitted values. Therefore, we can conclude that a linear model without transformations and/or interactions is not adequate.

  1. Generate 1000 replicates data sets using the coefficients from the model you fit above. Using RMSE as a statistic, \(\sqrt{\sum_i(y^{\text{rep}} - \hat{y}_i^{\text{rep}})^2/n }\), how does the RMSE from the model based on the training data compare to RMSE’s based on the replicated data. What does this suggest about model adequacy? Provide a histogram of the RMSE’s with a line showing the location of the observed RMSE and compute a p-value. Hint: write a function to calculate RMSE.
#rmse function
rmse = function(y, ypred){
  rmse=sqrt(mean((y-ypred)^2))
  return(rmse)
}

#number of replicates
nsim=1000
#Create design matrix
X = model.matrix(app_lm)
#Get 1000 samples of parameters
params = sim(app_lm, nsim)
#Create matrix to store 1000 sets of y's
y_rep = matrix(0, nrow = nsim ,ncol = nrow(College.train))
#Create vector to store 1000 RMSE statistic
RMSE.2<-rep(0,nsim)
#compute RMSE based on simulations
for(i in 1:nsim){
  y_rep[i,] = X %*% params@coef[i,] + rnorm(1, 0, params@sigma[i])
  RMSE.2[i]<-rmse(predict(app_lm), y_rep[i,])
  
}
RMSE.2 = data.frame(RMSE.2)

#compute RMSE based on train data
RMSE.1<-rmse(College.train$Apps,predict(app_lm))

#plot the histogram
GGPLOT.4<-ggplot(data= RMSE.2, aes(x=RMSE.2))+
  geom_histogram() + 
  geom_vline(xintercept = RMSE.1, color="red")+labs(x="RMSE", y="Count")
plot(GGPLOT.4)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#compute p-value
pval.lm<-sum(RMSE.2>RMSE.1)/nsim
RMSE.1
## [1] 1441.111
print(pval.lm)
## [1] 0.315

ANSWER: The RMSE of the observed data is not inconsistent with the simulated model. In fact, the tail probability, or p-value, of ‘r pval’ for RMSE is too large to not reject this model (based on this particular statistic).

  1. Build a second model, considering transformations of the response and predictors, possible interactions, etc with the goal of trying to achieve a model where assumptions for linear regression are satisfied, providing justification for your choices. Comment on how well the assumptions are met and and issues that diagnostic plots may reveal.

ANSWER: First we perform a log transformation on ‘Apps’. From the residual plot, we can see that the latter is much better than for the non-transformed model. We also applied the boxcox procedure to find the best y tranformation. And we can see 0 is close to the best parameter. Therefore, for a better intepratation of y, we admit log transformation on y. (More below)

#log transformation on y
app_lm_yt = lm(log(Apps)~.-college, data = College.train)
par(mfrow = c(2,2))
#diagnostic plots
plot(app_lm_yt)

#boxcox procedure to find best y transformation
boxcox(app_lm)

However, there is still a linear trend between residuals and fitted value based on the residual plots. We next move to x transformations. Let us go back to the scatter plots in question 2. As we have already seen, data is not evenly distributed. We applied log transformation on x which have many leverage points. For ‘Top25perc’ and ‘Grad.Rate’, we applied quadratic transformation on x since there is quadartic trend. (More below)

#log transformation on y
app_lm_t = lm(log(Apps) ~ Private + log(Top10perc) + (Top25perc)^2 +       
                log(F.Undergrad) +log(P.Undergrad) +  Outstate + 
                (Room.Board) + log(Books) + log(Personal) +         
                log(PhD)+Terminal+log(S.F.Ratio)+ perc.alumni +    
                log(Expend) + (Grad.Rate)^2+ Elite, 
              data=College.train)
summary(app_lm_t)
## 
## Call:
## lm(formula = log(Apps) ~ Private + log(Top10perc) + (Top25perc)^2 + 
##     log(F.Undergrad) + log(P.Undergrad) + Outstate + (Room.Board) + 
##     log(Books) + log(Personal) + log(PhD) + Terminal + log(S.F.Ratio) + 
##     perc.alumni + log(Expend) + (Grad.Rate)^2 + Elite, data = College.train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.46249 -0.22772  0.02455  0.21071  2.10428 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -2.568e+00  9.422e-01  -2.726 0.006612 ** 
## PrivateYes        5.431e-03  6.293e-02   0.086 0.931257    
## log(Top10perc)   -1.976e-01  5.347e-02  -3.696 0.000241 ***
## Top25perc         5.679e-03  1.952e-03   2.909 0.003767 ** 
## log(F.Undergrad)  9.719e-01  2.951e-02  32.935  < 2e-16 ***
## log(P.Undergrad) -5.516e-02  1.474e-02  -3.743 0.000200 ***
## Outstate          2.154e-05  8.322e-06   2.589 0.009884 ** 
## Room.Board        8.466e-05  2.169e-05   3.904 0.000106 ***
## log(Books)        1.444e-02  6.347e-02   0.228 0.820104    
## log(Personal)    -2.183e-02  3.784e-02  -0.577 0.564262    
## log(PhD)          2.419e-01  8.760e-02   2.761 0.005941 ** 
## Terminal         -1.821e-03  1.933e-03  -0.942 0.346545    
## log(S.F.Ratio)    5.713e-02  8.130e-02   0.703 0.482511    
## perc.alumni      -3.655e-04  1.797e-03  -0.203 0.838929    
## log(Expend)       1.546e-01  8.481e-02   1.823 0.068760 .  
## Grad.Rate         3.145e-03  1.245e-03   2.525 0.011838 *  
## EliteYes          1.608e-01  7.005e-02   2.296 0.022049 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3795 on 565 degrees of freedom
## Multiple R-squared:  0.8756, Adjusted R-squared:  0.8721 
## F-statistic: 248.6 on 16 and 565 DF,  p-value: < 2.2e-16
#diagnostic plots
par(mfrow = c(2,2))
plot(app_lm_t)

Finally, we applied step function(BIC) to see if there are potential interation terms. From the plots, we can see, both residual plots and QQplot are much better than before. There, we finally choose the model given below are the best linear model for this data.

# Systematic check for interactions
app_lm_t2<-lm(log(Apps) ~ Private + log(Top10perc) + (Top25perc)^2 +       
                log(F.Undergrad) + log(P.Undergrad) +  Outstate + 
                (Room.Board) + log(Books) + log(Personal) +         
                log(PhD)+Terminal+log(S.F.Ratio)+ perc.alumni +    
                log(Expend) + Grad.Rate+ (Elite)^2, 
              data=College.train)

#our best linear model using BIC
app_best = step(app_lm_t2, k=log(nrow(College.train)),trace=0)

summary(app_best)
## 
## Call:
## lm(formula = log(Apps) ~ log(Top10perc) + Top25perc + log(F.Undergrad) + 
##     log(P.Undergrad) + Outstate + Room.Board + log(PhD) + Elite, 
##     data = College.train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.52512 -0.23310  0.02069  0.22267  2.09038 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -1.086e+00  2.559e-01  -4.244 2.56e-05 ***
## log(Top10perc)   -1.732e-01  5.089e-02  -3.404  0.00071 ***
## Top25perc         5.779e-03  1.881e-03   3.073  0.00222 ** 
## log(F.Undergrad)  9.778e-01  2.288e-02  42.736  < 2e-16 ***
## log(P.Undergrad) -6.152e-02  1.422e-02  -4.327 1.79e-05 ***
## Outstate          3.134e-05  6.572e-06   4.769 2.35e-06 ***
## Room.Board        9.707e-05  2.051e-05   4.734 2.78e-06 ***
## log(PhD)          1.986e-01  6.711e-02   2.959  0.00321 ** 
## EliteYes          1.828e-01  6.830e-02   2.677  0.00765 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3802 on 573 degrees of freedom
## Multiple R-squared:  0.8734, Adjusted R-squared:  0.8716 
## F-statistic: 493.9 on 8 and 573 DF,  p-value: < 2.2e-16
#diagnostic plots
par(mfrow=c(2,2))
plot(app_best,labels.id = College.train$college)

# Termplots for further diagnostics
par(mfrow=c(2,2))
termplot(app_best, data=College.train,partial.resid=T, smooth=panel.smooth, se=T)

  1. Repeat problem 4, but using your model from problem 5. If you transform the response, you will need to back transform data to the original units in order to compute the RMSE in the original units. Does this suggest that the model is adequate? Do the two graphs provide information about which model is better?
nsim=1000
rmse = function(y, ypred){
  rmse=sqrt(mean((y-ypred)^2))
  return(rmse)
}
X<-model.matrix(app_best, 
                data = College.train)
params<-sim(app_best, nsim)
Y<-matrix(0, nrow=nrow(College.train),ncol=nsim )
RMSE.2<-rep(0,nsim)
# note that we had log-transformed the response; hence we need to exponentiate the simulated responses
for(j in 1:nsim){
  Y[,j]<-exp(X %*% params@coef[j,]+ rnorm(1, 0, params@sigma[j]))
  RMSE.2[j]<-rmse(Y[,j],exp(predict(app_best)))
  
}
RMSE.2<-as.data.frame(RMSE.2)


RMSE.1.bestlm<-rmse(College.train$Apps,exp(predict(app_best)))
GGPLOT.6<-ggplot(data= RMSE.2, aes(x=RMSE.2))+geom_histogram() + geom_vline(xintercept = RMSE.1.bestlm, col="red")+labs(x="RMSE", y="Count")
print(GGPLOT.6)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#pvalue
pval.best<-sum(RMSE.2>RMSE.1)/nsim
print(pval.best)
## [1] 0.386

ANSWER: We find that the observed value is closer to the mode of the simulated RMSE distribution. As indicated by the current p-value of ‘r pval’, the current model is a better fit to the observed data in the training set. Of course, this is based on a single summary statistic only (RMSE), and cannot be used for final approval of the model.

  1. Use your two fitted models to predict the number of applications for the testing data, College.test. Plot the predicted residuals \(y_i - \hat{y}_i\) versus the predictions. Are there any cases where the model does a poor job of predicting? Compute the RMSE using the test data where now RMSE = \(\sqrt{\sum_{i = 1}^{n.test}(y_i - \hat{y}_i)^2/n.test}\) where the sum is over the test data. Which model is better for the out of sample prediction?

ANSWER: First we plot the predictions for the base model with main effects only, and the ‘best’ model obtained by BIC. We found that a few predictions have an excessively high residuals, so we removed those from the plot.

# rerun the base model
apps.lm0<-lm(Apps~., data=College.train[,-17])
a.new<-predict(apps.lm0, newdata = College.test[,-17])
b.new<-predict(app_best, newdata=College.test)

# remove those with residual > 8000
ind.a<-which(abs(a.new-College.test$Apps)>8000)
ind.b<-which(abs(exp(b.new)-College.test$Apps)>8000)

par(mfrow=c(1,2))
plot(a.new[-ind.a],a.new[-ind.a]-College.test$Apps[-ind.a],xlab="Predicted", ylab="Residual",main ="Base model (#4)")
plot(exp(b.new[-ind.b]), exp(b.new[-ind.b])-College.test$Apps[-ind.b],xlab="Predicted", ylab="Residual",main="Best model (#6)")

# colleges with extremely high number of applications
print(College$college[ind.a])
## [1] "Antioch University"         "College of Saint Catherine"
print(College$college[ind.b])
## [1] "College of Saint Catherine"

The worst fits (residual >8000) are for “Antioch University” and “College of Saint Catherine” for the base model, and “College of Saint Catherine” and “Dordt College” for the ‘best’ linear model. Next, we compute the RMSE of the test data.

RMSE.0.test<-sqrt(mean((a.new-College.test$Apps)^2))
RMSE.best.test<-sqrt(mean((exp(b.new)-(College.test$Apps))^2))

print(RMSE.0.test)
## [1] 2894.803
print(RMSE.best.test)
## [1] 2781.898

With an RMSE of ‘r RMSE.best.test’ the out of sample prediction is better for the model with interactions and transformations compared to the base model which has an RMSE of ‘r RMSE.0.test’.

  1. Add the test RMSE’s from the two models to the respective histograms from 4 and 6. Are these values surprising relative to the RMSEs from the replicated data? Explain. What do you think this implies for model adequacy checks? How accurately can we predict college applications?
par(mfrow=c(1,2))
plot(GGPLOT.4+geom_vline(xintercept = RMSE.0.test, col="yellow"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

plot(GGPLOT.6+ geom_vline(xintercept = RMSE.best.test, col="yellow"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

pval.1<-sum(RMSE.0.test>RMSE.2)/nsim
print(pval.1)
## [1] 0.904
pval.2<-sum(RMSE.best.test>RMSE.2)/nsim
print(pval.2)
## [1] 0.899

ANSWER: The p-values of the test data are ‘r pval.1’ and ‘r pval.2’, respectively. This means if the model were true, the observed test RMSE’s are quite unlikely. This implies that we cannot accurately predict the left out data with this model.

  1. As the number of applications is a count variable, a Poisson regression model is a natural alternative for modelling this data. Build a Poisson model using only main effects as in problem 4. Comment on the model adequacy based on diagnostic plots and other summaries. Is there evidence that there is lack of fit?
Apps.poi = glm(Apps ~ .  -college, family = poisson(link = "log"), data = College.train)
summary(Apps.poi)
## 
## Call:
## glm(formula = Apps ~ . - college, family = poisson(link = "log"), 
##     data = College.train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -84.601  -18.714   -6.817    9.334  105.987  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  5.366e+00  9.451e-03  567.77   <2e-16 ***
## PrivateYes  -7.638e-01  2.908e-03 -262.61   <2e-16 ***
## Top10perc   -5.634e-03  1.095e-04  -51.46   <2e-16 ***
## Top25perc    4.267e-03  8.845e-05   48.24   <2e-16 ***
## F.Undergrad  7.523e-05  1.667e-07  451.16   <2e-16 ***
## P.Undergrad  2.860e-05  3.947e-07   72.48   <2e-16 ***
## Outstate     4.686e-05  3.808e-07  123.05   <2e-16 ***
## Room.Board   1.173e-04  1.014e-06  115.72   <2e-16 ***
## Books        3.013e-04  4.777e-06   63.07   <2e-16 ***
## Personal    -6.213e-05  1.475e-06  -42.12   <2e-16 ***
## PhD          8.270e-03  1.225e-04   67.53   <2e-16 ***
## Terminal    -2.724e-03  1.391e-04  -19.59   <2e-16 ***
## S.F.Ratio    1.909e-02  2.528e-04   75.51   <2e-16 ***
## perc.alumni -1.120e-02  9.122e-05 -122.82   <2e-16 ***
## Expend       2.491e-05  1.875e-07  132.83   <2e-16 ***
## Grad.Rate    1.060e-02  7.035e-05  150.62   <2e-16 ***
## EliteYes     1.947e-01  3.809e-03   51.11   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 1753917  on 581  degrees of freedom
## Residual deviance:  386292  on 565  degrees of freedom
## AIC: 391729
## 
## Number of Fisher Scoring iterations: 5
par(mfrow=c(2,2))
plot(Apps.poi)

#checking for overdispersion
Apps.poi$deviance/Apps.poi$df.residual
## [1] 683.7027

ANSWER: Due to the inflexible variance structure (variance=mean) of the Poisson, we find that all predictors are highly significant. The residual deviance is very large, and the ratio of residual deviance divided by the residual degrees of freedom is 683 which is \(\qq 1\), indicating substantial overdispersion (no need for a formal test here!). In the diagnostic plots, we observe that some of the standard deviance residuals are huge - indicating cause for concern. In summary, the overdispersion indicates a lack of fit.

  1. Generate 1000 replicates data sets using the coefficients from the Poisson model you fit above. Using RMSE as a statistic, \(\sqrt{\sum_i(y^{\text{rep}} - \hat{y}_i^{\text{rep}})^2/n }\), how does the RMSE from the model based on the training data compare to RMSE’s based on the replicated data. What does this suggest about model adequacy? Provide a histogram of the RMSE’s with a line showing the location of the observed RMSE and compute a p-value.

ANSWER: The RMSE of the model is completely out of whack compared to the replicated data (p value is estiamted to be 0 based on 1000 simulations). In conculsion, based on the RMSE as a test statistic, we have very strong evidence to reject the Poisson model.

S = 1000
beta = sim(Apps.poi, n.sims = 1000)
X = model.matrix(Apps.poi)
Y = matrix(NA, nrow(College.train), ncol = S)

rmse.train.poi = rmse(predict(Apps.poi), College.train$Apps)
rmse.rep.poi = numeric(S)

for (s in 1:S) {
  lambda = exp(X %*% beta@coef[s,]) 
  Y[, s] = rpois(nrow(College.train),lambda)
  rmse.rep.poi[s] = rmse(predict(Apps.poi), Y[, s])
}


rmse.rep.poi = as.data.frame(rmse.rep.poi)

hist.poi <-ggplot(data= rmse.rep.poi, aes(x=rmse.rep.poi)) + geom_histogram(fill = "slateblue") + geom_vline(xintercept = rmse.train.poi, col="red")
print(hist.poi)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

rmse.train.poi
## [1] 4550.667
pval.poi <-sum(rmse.rep.poi > rmse.train.poi)/S
print(pval.poi)
## [1] 0
  1. Using the test data set, calculate the RMSE for the test data using the predictions from the Poisson model. How does this compare to the RMSE based on the observed data? Add this RMSE to your plot above.

ANSWER: Below, we show the simulated RMSE, together with the training RMSE (red) and the test RMSE (blue) – we see that the latter is much larger than the training RMSE (not surprising).

Apps.poi = glm(Apps ~ . , family = poisson(link = "log"), data = College.train[,-17])

rmse.test = rmse(College.test$Apps, predict(Apps.poi, newdata=College.test[,-17]))
hist.poi <- hist.poi + geom_vline(xintercept = rmse.test, colour = "cornflowerblue")
rmse.test
## [1] 5787.812
plot(hist.poi)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

  1. As in problem 6, consider transformations of the predictors and interactions to improve your model justifying your choices. Provide a summary of the model and comment on diagnostics and summaries for model fit. Is there evidence of overdispersion? Explain.

ANSWER: Given the overdispersion, a first step is to add a variable dispersion parameter, i.e., use a quasi-Poisson model. Of course, this will not change the deviance as the likelihood remains the same, but it will increase the standard errors and give a better idea of which predictors are important. (See explanation of plots below)

# Quasi-Poisson 
Apps.qpoi.1 <- glm(Apps ~ . , family = quasipoisson(link = "log"), data = College.train[,-17])
summary(Apps.qpoi.1)
## 
## Call:
## glm(formula = Apps ~ ., family = quasipoisson(link = "log"), 
##     data = College.train[, -17])
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -84.601  -18.714   -6.817    9.334  105.987  
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.366e+00  2.572e-01  20.867  < 2e-16 ***
## PrivateYes  -7.638e-01  7.914e-02  -9.651  < 2e-16 ***
## Top10perc   -5.634e-03  2.979e-03  -1.891  0.05911 .  
## Top25perc    4.267e-03  2.407e-03   1.773  0.07678 .  
## F.Undergrad  7.522e-05  4.537e-06  16.581  < 2e-16 ***
## P.Undergrad  2.860e-05  1.074e-05   2.664  0.00795 ** 
## Outstate     4.686e-05  1.036e-05   4.522 7.46e-06 ***
## Room.Board   1.173e-04  2.759e-05   4.253 2.47e-05 ***
## Books        3.013e-04  1.300e-04   2.318  0.02081 *  
## Personal    -6.213e-05  4.013e-05  -1.548  0.12214    
## PhD          8.270e-03  3.332e-03   2.482  0.01336 *  
## Terminal    -2.724e-03  3.784e-03  -0.720  0.47184    
## S.F.Ratio    1.909e-02  6.878e-03   2.775  0.00570 ** 
## perc.alumni -1.120e-02  2.482e-03  -4.514 7.75e-06 ***
## Expend       2.491e-05  5.102e-06   4.882 1.37e-06 ***
## Grad.Rate    1.060e-02  1.914e-03   5.536 4.75e-08 ***
## EliteYes     1.947e-01  1.036e-01   1.878  0.06084 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasipoisson family taken to be 740.3351)
## 
##     Null deviance: 1753917  on 581  degrees of freedom
## Residual deviance:  386292  on 565  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 5
par(mfrow=c(2,2))
plot(Apps.qpoi.1)

Looking at the diagnostic plots, we find that using the overdispersed Poisson helped substantially. It brought the standard deviance residual from a max of an absolute value of 100 to a max of a little less than 4, and a max Std Pearson residual from 150 to less than 5. There is still a trend in the Scale-Location plot.

Next, we used the predictor transformations and interactions from the best linear model to see if this would resolve some of these issues (note we cannot use the step procedure for the QP because there is no explicit likelihood for this model). As shown below, we found that the pattern in the residual plot is attenuated, and the trend in the std deviance residual plot is lower than in the previous model. The overdispersion has been lowered by accounting for the additional terms, but there is still a lack of fit as indicated by the ratio of the residual deviance by the residual degrees of freedom which is ~284.

# Quasi-Poisson with transformations and interactions
Apps.qpoi.2 <- glm(Apps ~ Private + log(Top10perc) + log(Top25perc) + 
                     log(F.Undergrad) + log(P.Undergrad) + Outstate + Room.Board + 
                     log(PhD) + log(S.F.Ratio) + log(Expend) + log(Grad.Rate) + 
                     Private:Room.Board + log(Top10perc):log(Top25perc) + log(P.Undergrad):Outstate + 
                     Outstate:log(Expend) + Room.Board:log(Expend) + log(S.F.Ratio):log(Grad.Rate) + 
                     log(Expend):log(Grad.Rate), family = quasipoisson(link = "log"), data = College.train)

summary(Apps.qpoi.2)
## 
## Call:
## glm(formula = Apps ~ Private + log(Top10perc) + log(Top25perc) + 
##     log(F.Undergrad) + log(P.Undergrad) + Outstate + Room.Board + 
##     log(PhD) + log(S.F.Ratio) + log(Expend) + log(Grad.Rate) + 
##     Private:Room.Board + log(Top10perc):log(Top25perc) + log(P.Undergrad):Outstate + 
##     Outstate:log(Expend) + Room.Board:log(Expend) + log(S.F.Ratio):log(Grad.Rate) + 
##     log(Expend):log(Grad.Rate), family = quasipoisson(link = "log"), 
##     data = College.train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -69.101  -11.045   -2.548    6.880   98.017  
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    2.034e+01  8.595e+00   2.366 0.018312 *  
## PrivateYes                     5.463e-02  1.637e-01   0.334 0.738760    
## log(Top10perc)                -6.256e-01  1.816e-01  -3.445 0.000614 ***
## log(Top25perc)                -2.330e-01  1.477e-01  -1.577 0.115423    
## log(F.Undergrad)               9.019e-01  2.790e-02  32.324  < 2e-16 ***
## log(P.Undergrad)              -1.564e-01  3.286e-02  -4.759 2.48e-06 ***
## Outstate                      -3.937e-04  1.302e-04  -3.022 0.002622 ** 
## Room.Board                     1.617e-03  4.582e-04   3.530 0.000450 ***
## log(PhD)                       1.285e-01  9.028e-02   1.424 0.155065    
## log(S.F.Ratio)                -1.487e+00  1.067e+00  -1.394 0.164014    
## log(Expend)                   -1.796e+00  7.301e-01  -2.460 0.014174 *  
## log(Grad.Rate)                -5.536e+00  2.116e+00  -2.617 0.009111 ** 
## PrivateYes:Room.Board         -4.621e-05  3.383e-05  -1.366 0.172557    
## log(Top10perc):log(Top25perc)  1.347e-01  4.390e-02   3.069 0.002248 ** 
## log(P.Undergrad):Outstate      1.005e-05  2.270e-06   4.427 1.15e-05 ***
## Outstate:log(Expend)           3.899e-05  1.371e-05   2.843 0.004628 ** 
## Room.Board:log(Expend)        -1.629e-04  5.068e-05  -3.214 0.001382 ** 
## log(S.F.Ratio):log(Grad.Rate)  3.697e-01  2.538e-01   1.457 0.145699    
## log(Expend):log(Grad.Rate)     5.388e-01  1.813e-01   2.971 0.003090 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasipoisson family taken to be 320.3511)
## 
##     Null deviance: 1753917  on 581  degrees of freedom
## Residual deviance:  160007  on 563  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 4
par(mfrow=c(2,2))
plot(Apps.qpoi.2)

Apps.qpoi.2$deviance/Apps.qpoi.2$df.residual
## [1] 284.2047
pchisq(Apps.qpoi.2$deviance, Apps.qpoi.2$df.residual, lower.tail = F)
## [1] 0
  1. Carry out the predictive check using simulated data from the Poisson model. Add the RMSE for the observed data to the plot and the RMSE for prediction on the test data. Does this suggest that the model is OK?
# Choose the model
model<-Apps.qpoi.2 # .1 for untransformed predictors, .2 for transformed predictors

# Simulate 1000 replicates
S = 1000
n<-nrow(College.train)
qp.sim = sim(model, n.sims = 1000)
# Generate the design matrix 
X.q = model.matrix(model, data = College.train)
# Allocate the output 
Y.q = matrix(NA, nrow(College.train), ncol = S)
rmse.rep.qpoi = rep(0, S)

# Compute the rmse of the training data
rmse.qpoi.train = rmse(predict(model, newdata=College.train), College.train$Apps)

# Compute the rmse of the test data
rmse.qpoi.test = rmse(predict(model, newdata=College.test), College.test$Apps)

# The dispersion prameter
d<-(qp.sim@sigma[1])^2

for (s in 1:S) {
  # generate the mu's
  mu<-exp(X.q %*% qp.sim@coef[s,])
  # generate the response realization
  Y.q[, s] = rnegbin(n, mu=mu, theta=mu/(d-1))
  # calculate the RMSE  
  rmse.rep.qpoi[s] = rmse(predict(model), Y.q[, s] )
}


RMSE.rep.qp= data.frame(rmse.rep.qpoi)

#plot the histogram

GGPLOT.13<-ggplot(data= RMSE.rep.qp, aes(x=RMSE.rep.qp))+
  geom_histogram(fill = "slateblue") + 
  geom_vline(xintercept = rmse.qpoi.train, color="red") 


GGPLOT.13.2<-GGPLOT.13+geom_vline(xintercept = rmse.qpoi.test, col="cornflowerblue") + xlab("RMSE")+ ylab("Count")

plot(GGPLOT.13.2)
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#compute p-value
pval.train<-sum(RMSE.rep.qp>rmse.qpoi.train)/nsim
print(pval.train)
## [1] 0.668
pval.test<-sum(RMSE.rep.qp>rmse.qpoi.test)/nsim
print(pval.test)
## [1] 0

ANSWER: Accounting for dispersion leads to a good fit to the training data based on the RMSE statistic. The corresponding p-value is ‘r pval.train’ and indicates that there is no evidence to reject the model. For the test data however, we are doing very poorly as can be seen visually and the p-value of 0.

  1. Build a model using the negative binomial model (consider transformations and interactions if needed) and examine diagnostic plots. Are there any suggestions of problems with this model?

ANSWER: First we consider the untransformed model as follows.

Apps.nb <- glm.nb(Apps ~ . -college, data = College.train, link = log)
summary(Apps.nb)
## 
## Call:
## glm.nb(formula = Apps ~ . - college, data = College.train, link = log, 
##     init.theta = 3.652914483)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.3065  -0.7978  -0.2008   0.5036   4.1839  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  4.731e+00  2.333e-01  20.274  < 2e-16 ***
## PrivateYes  -4.762e-01  7.970e-02  -5.975 2.31e-09 ***
## Top10perc   -5.152e-03  3.948e-03  -1.305 0.191948    
## Top25perc    6.483e-03  2.713e-03   2.389 0.016874 *  
## F.Undergrad  1.272e-04  7.382e-06  17.237  < 2e-16 ***
## P.Undergrad  4.474e-06  1.786e-05   0.250 0.802220    
## Outstate     3.537e-05  1.089e-05   3.250 0.001156 ** 
## Room.Board   1.159e-04  2.839e-05   4.084 4.43e-05 ***
## Books        3.660e-04  1.315e-04   2.783 0.005391 ** 
## Personal    -3.174e-05  3.874e-05  -0.819 0.412604    
## PhD          1.003e-02  2.678e-03   3.745 0.000181 ***
## Terminal    -4.722e-03  3.022e-03  -1.563 0.118089    
## S.F.Ratio    3.796e-02  7.438e-03   5.103 3.34e-07 ***
## perc.alumni -7.534e-03  2.426e-03  -3.105 0.001902 ** 
## Expend       2.511e-05  7.644e-06   3.285 0.001019 ** 
## Grad.Rate    7.310e-03  1.706e-03   4.285 1.83e-05 ***
## EliteYes     2.203e-01  1.230e-01   1.791 0.073259 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(3.6529) family taken to be 1)
## 
##     Null deviance: 2339.63  on 581  degrees of freedom
## Residual deviance:  608.44  on 565  degrees of freedom
## AIC: 9654.4
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  3.653 
##           Std. Err.:  0.206 
## 
##  2 x log-likelihood:  -9618.373
par(mfrow = c(2, 2))
plot(Apps.nb)

Apps.nb$deviance/Apps.nb$df.residual
## [1] 1.076887
pchisq(Apps.nb$deviance, Apps.nb$df.residual, lower.tail = F)
## [1] 0.1002212

The dispersion parameter for this model is quite small (3.6). Looking at the residual plot however we see a strong shape in the first subplot (residuals), and a trend in the third subplot (std. deviance residuals). In an attempt to improve things, we explored using hte same transformations as in the linear model, as follows:

Apps.nb <- glm.nb(Apps ~ Private + log(Top10perc) + log(Top25perc) + 
                    log(F.Undergrad) + log(P.Undergrad) + Outstate + Room.Board + 
                    log(PhD) + log(S.F.Ratio) + log(Expend) + log(Grad.Rate) + 
                    Private:Room.Board + log(Top10perc):log(Top25perc) + log(P.Undergrad):Outstate + 
                    Outstate:log(Expend) + Room.Board:log(Expend) + log(S.F.Ratio):log(Grad.Rate) + 
                    log(Expend):log(Grad.Rate), data = College.train, link = log)
summary(Apps.nb)
## 
## Call:
## glm.nb(formula = Apps ~ Private + log(Top10perc) + log(Top25perc) + 
##     log(F.Undergrad) + log(P.Undergrad) + Outstate + Room.Board + 
##     log(PhD) + log(S.F.Ratio) + log(Expend) + log(Grad.Rate) + 
##     Private:Room.Board + log(Top10perc):log(Top25perc) + log(P.Undergrad):Outstate + 
##     Outstate:log(Expend) + Room.Board:log(Expend) + log(S.F.Ratio):log(Grad.Rate) + 
##     log(Expend):log(Grad.Rate), data = College.train, link = log, 
##     init.theta = 7.93895779)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.4634  -0.7555  -0.0975   0.5061   7.3945  
## 
## Coefficients:
##                                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                    3.504e+01  8.697e+00   4.028 5.61e-05 ***
## PrivateYes                     6.453e-01  1.702e-01   3.790 0.000150 ***
## log(Top10perc)                -7.300e-01  1.715e-01  -4.258 2.07e-05 ***
## log(Top25perc)                -4.337e-01  1.483e-01  -2.924 0.003452 ** 
## log(F.Undergrad)               9.547e-01  2.719e-02  35.113  < 2e-16 ***
## log(P.Undergrad)              -1.593e-01  3.264e-02  -4.882 1.05e-06 ***
## Outstate                      -5.907e-04  1.488e-04  -3.971 7.15e-05 ***
## Room.Board                     1.769e-03  4.907e-04   3.606 0.000311 ***
## log(PhD)                       1.915e-01  6.568e-02   2.916 0.003542 ** 
## log(S.F.Ratio)                -3.713e+00  1.049e+00  -3.541 0.000399 ***
## log(Expend)                   -2.761e+00  7.448e-01  -3.707 0.000209 ***
## log(Grad.Rate)                -8.864e+00  2.170e+00  -4.084 4.42e-05 ***
## PrivateYes:Room.Board         -1.621e-04  4.053e-05  -3.999 6.35e-05 ***
## log(Top10perc):log(Top25perc)  1.790e-01  4.374e-02   4.093 4.26e-05 ***
## log(P.Undergrad):Outstate      1.051e-05  2.481e-06   4.238 2.26e-05 ***
## Outstate:log(Expend)           6.043e-05  1.568e-05   3.854 0.000116 ***
## Room.Board:log(Expend)        -1.716e-04  5.472e-05  -3.136 0.001713 ** 
## log(S.F.Ratio):log(Grad.Rate)  9.294e-01  2.543e-01   3.655 0.000258 ***
## log(Expend):log(Grad.Rate)     7.280e-01  1.882e-01   3.868 0.000110 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(7.939) family taken to be 1)
## 
##     Null deviance: 5072.09  on 581  degrees of freedom
## Residual deviance:  594.69  on 563  degrees of freedom
## AIC: 9181
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  7.939 
##           Std. Err.:  0.460 
## 
##  2 x log-likelihood:  -9140.991
par(mfrow = c(2, 2))
plot(Apps.nb, labels.id = College.train$college)

Apps.nb$deviance/Apps.nb$df.residual
## [1] 1.056292
p.val.22<-pchisq(Apps.nb$deviance, Apps.nb$df.residual, lower.tail = F)

After adding in the transformations we seen an improvement in the residual and std deviance residual plots. The p-value for overdispersion is now at ‘r p.val.22’, indicating that we cannot reject this model. The remaining cause for concern is Talladega College with a high residual and std deviance residual - this college is an outlier, but since it does not have high leverage and a Cooks’ distance below .5 we are not worried.

  1. Carry out the predictive checks using simulated replicates with RMSE and add RMSE from the test data and observed data to your plot. What do these suggest about 1) model adequacy and 2) model comparison? Which model out of all that you have fit do you recommend?
S = 1000
class(Apps.nb) = "glm"
beta.nb = sim(Apps.nb, n.sims = 1000)
X.nb = model.matrix(Apps.nb)
Y.nb = matrix(NA, nrow(College.train), ncol = S)
beta.nb@sigma = rnorm(S, Apps.nb$theta, Apps.nb$SE.theta)

rmse.train.nb = rmse(predict(Apps.nb), College.train$Apps)
rmse.test.nb = rmse(predict(Apps.nb, newdata = College.test), College.test$Apps)
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : calling predict.lm(<fake-lm-object>) ...
rmse.rep.nb = rep(0, S)
rmse.test.nb
## [1] 5787.761
for (s in 1:S) {
  mu = exp(X.nb %*% beta.nb@coef[s, ])
  Y.nb[,s] = rnegbin(n, mu=mu, theta=beta.nb@sigma[s])
  rmse.rep.nb[s] = rmse(predict(Apps.nb), Y.nb[,s])
}

rmse.nb.df = data.frame(rsme.test.nb = rmse.test.nb, rmse.rep.nb = rmse.rep.nb)
rmse.rep.nb = as.data.frame(rmse.rep.nb)

hist.nb <-ggplot(data= rmse.rep.nb, aes(x=rmse.rep.nb)) + geom_histogram(fill = "slateblue") + geom_vline(xintercept = rmse.train.nb, color = "red") + geom_vline(xintercept = rmse.test.nb, col="cornflowerblue") + xlab("RMSE") + ylab("Count")
print(hist.nb)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

pval.nb <-sum(rmse.rep.nb > rmse.test.nb)/S
print(pval.nb)
## [1] 0.04

ANSWER: From the histogram, it is clear that the Negative Binomial helps to explain more of the variation in the data than the Poisson or Quasi-Poisson. Judging from the small RMSE for the training data, we may be overfitting the data. The RMSE of the test set is closer to the mean of the simulations than those of the former, but the p-value is small at ‘r pval’, indicating that we should be careful when using the model for prediction purposes!

  1. While RMSE is a popular summary for model goodness of fit, coverage of confidence intervals is an alternative. For each case in the test set, find a 95% prediction interval. Now evaluate if the responses in the test data are inside or outside of the intervals. If we have the correct coverage, we would expect that at least 95% of the intervals would contain the test cases. Write a function to calculate coverage (the input should be the fitted model object and the test data-frame) and then evaluate coverage for each of the 5 models that you fit (the two normal, the two Poisson and the negative binomial) including plots of the confidence intervals versus ordered by the prediction, with the left out data added as points. Comment on the plots, highlighting any unusual colleges where the model predicts poorly.
#get prediction interval directly for linear regression models
pi.lm =   predict.lm(object = apps.lm0,newdata = 
                       College.test[,-17],se.fit=TRUE,interval="prediction",level=0.95)$fit[,2:3]
pi.best = exp(predict.lm(object = app_best,newdata =  
                           College.test[,-17],se.fit=TRUE,interval="prediction",level=0.95)$fit[,2:3])
#get prediction interval for poisson regression with simulation  
S = 1000
beta = sim(Apps.poi, n.sims = 1000)
X = model.matrix(Apps.poi,data=College.test[,-17])
Y = matrix(NA, nrow(College.test), ncol = S)

for (s in 1:S) {
  lambda = exp(X %*% beta@coef[s,]) 
  Y[, s] = rpois(nrow(College.test),lambda)
}
pi.poi = t(apply(Y, 1, function(x) {quantile(x, c((1 - .95)/2,.5 + .95/2))}))
#prediction interval for quasipoisson
# Simulate 1000 replicates
S = 1000
n<-nrow(College.test)
qp.sim = sim(Apps.qpoi.2, n.sims = 1000)
# Generate the design matrix 
X.q = model.matrix(Apps.qpoi.2, data = College.test[,-17])
# Allocate the output 
Y.q = matrix(NA, nrow(College.test[,-17]), ncol = S)
# ????? should be the square, correct????
d<-(qp.sim@sigma[1])^2
for (s in 1:S) {
  # generate the mu's
  mu<-exp(X.q %*% qp.sim@coef[s,])
  # generate the response realization
  Y.q[, s] = rnegbin(n, mu=mu, theta=mu/(d-1))
}
pi.qpoi.2 = t(apply(Y.q, 1, function(x) {quantile(x, c((1 - .95)/2,.5 + .95/2))}))
#prediction interval for negative binomial
nsim=1000
require(mvtnorm)
n = nrow(College.test)
X = model.matrix(Apps.nb, data=College.test[,-17])
class(Apps.nb)<-"glm"
sim.nb<-sim(Apps.nb, nsim)
sim.nb@sigma<-rnorm(nsim, Apps.nb$theta, Apps.nb$SE.theta)


#beta = rmvnorm(nsim, coef(Apps.nb), vcov(Apps.nb)) 
#theta = rnorm(nsim, Apps.nb$theta, Apps.nb$SE.theta) 
y.rep = matrix(NA, nsim, n)
for (i in 1:nsim) {
  mu = exp(X %*% sim.nb@coef[i,])
  y.rep[i,] = rnegbin(n, mu=mu, theta=sim.nb@sigma[i])
  
  }

pi.nb = t(apply(y.rep, 2, function(x) {quantile(x, c((1 - .95)/2,.5 + .95/2))}))
coverage = function(mod,test.df){
  y = test.df[,"Apps"]
  if(identical(mod,apps.lm0)){
    return(mean(y >= pi.lm[,1] & y <= pi.lm[,2]))
  }
  if(identical(mod,app_best)){
    return(mean(y >= pi.best[,1] & y <= pi.best[,2]))
  }
  if(identical(mod,Apps.poi)){
    return(mean(y >= pi.poi[,1] & y <= pi.poi[,2]))
  }
  if(identical(mod,Apps.qpoi.2)){
    return(mean(y >= pi.qpoi.2[,1] & y <= pi.qpoi.2[,2]))
  }
  if(identical(mod,Apps.nb)){
    return(mean(y >= pi.nb[,1] & y <= pi.nb[,2]))
  }
}
# coverage for each of the 5 models that you fit
a = coverage(apps.lm0,College.test[,-17])
b =coverage(app_best,College.test[,-17])
c =coverage(Apps.poi,College.test[,-17])
d =coverage(Apps.qpoi.2,College.test[,-17])
e =coverage(Apps.nb,College.test[,-17])
a;b;c;d;e
## [1] 0.9589744
## [1] 0.9487179
## [1] 0.08205128
## [1] 0.9641026
## [1] 0.3333333
par(mfrow=c(2,2))
#plots of the confidence intervals versus ordered by the prediction, with the left out data added as points
df = data.frame(apps = College.test$Apps, 
                pred = predict(apps.lm0,College.test[,-17],type="response"),  
                lwr = pi.lm[,1], upr=pi.lm[,2]) 
df = df %>% arrange(pred)   # sort by prediction

gp = ggplot(df, aes(x=pred, y=apps)) + 
  geom_ribbon(aes(ymin = lwr, ymax = upr), 
              fill = "blue", alpha = 0.2) + 
  geom_point(aes(y=apps)) +
  xlab("predicted value") +
  ylab("Apps") +
  ggtitle("95% Prediction Intervals under linear regression Model(apps.lm0)")
gp

# plot for app_best
df = data.frame(apps = College.test$Apps, 
                pred = exp(predict(app_best,College.test[,-17])),  
                lwr = pi.best[,1], upr=pi.best[,2]) 
df = df %>% arrange(pred)   # sort by prediction

gp = ggplot(df, aes(x=pred, y=apps)) + 
  geom_ribbon(aes(ymin = lwr, ymax = upr), 
              fill = "blue", alpha = 0.2) + 
  geom_point(aes(y=apps)) +
  xlab("predicted value") +
  ylab("Apps") +
  ggtitle("95% Prediction Intervals under linear regression Model(app_best)")
gp

#plot for Apps.poi
df = data.frame(apps = College.test$Apps, 
                pred = predict(Apps.poi,College.test[,-17],type="response"),  
                lwr = pi.poi[,1], upr=pi.poi[,2]) 
df = df %>% arrange(pred)   # sort by prediction

gp = ggplot(df, aes(x=pred, y=apps)) + 
  geom_ribbon(aes(ymin = lwr, ymax = upr), 
              fill = "blue", alpha = 0.2) + 
  geom_point(aes(y=apps)) +
  xlab("predicted value") +
  ylab("Apps") +
  ggtitle("95% Prediction Intervals under Poisson regression Model(Apps.poi)")
gp

#plot for Apps.qpoi.2
df = data.frame(apps = College.test$Apps, 
                pred = predict(Apps.qpoi.2,College.test[,-17],type="response"),  
                lwr = pi.qpoi.2[,1], upr=pi.qpoi.2[,2]) 
df = df %>% arrange(pred)   # sort by prediction

gp = ggplot(df, aes(x=pred, y=apps)) + 
  geom_ribbon(aes(ymin = lwr, ymax = upr), 
              fill = "blue", alpha = 0.2) + 
  geom_point(aes(y=apps)) +
  xlab("predicted value") +
  ylab("Apps") +
  ggtitle("95% Prediction Intervals under QuasiPoisson regression Model(Apps.qpoi.2)")
gp

#plot for Apps.nb
df = data.frame(apps = College.test$Apps, 
                pred = predict.glm(Apps.nb,College.test[,-17],type="response"),  
                lwr = pi.nb[,1], upr=pi.nb[,2]) 
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : calling predict.lm(<fake-lm-object>) ...
df = df %>% arrange(pred)   # sort by prediction

gp = ggplot(df, aes(x=pred, y=apps)) + 
  geom_ribbon(aes(ymin = lwr, ymax = upr), 
              fill = "blue", alpha = 0.2) + 
  geom_point(aes(y=apps)) +
  xlab("predicted value") +
  ylab("Apps") +
  ggtitle("95% Prediction Intervals under Negative Binomial Model(Apps.nb)")
gp

College.test %>%
          filter(Apps>40000)%>%
          select(college)
##                    college
## 1 Rutgers at New Brunswick

From the plots, it appears that only the first two, which are both linear regression models, indicate good coverage. In addition, the quasipoisson model also provides a very good coverage. The other two glm models seem to provide bad coverage for prediction, probably due to overfitting. From the plot, it can be observed that one point(Rutgers at New Brunswick) has unusually high y value, and the models seems to predict poorly on that outlier point.

  1. Provide a table with the 1) RMSE’s on the observed data, 2) RMSE’s on the test data, 3) coverage, 4) the predictive check p-value with one row for each of the 5 models and comment the results. Which model do you think is best and why?
RMSE_obs = c(RMSE.1,RMSE.1.bestlm,rmse.train.poi,rmse.qpoi.train,rmse.train.nb)
RMSE_test = c(RMSE.0.test, RMSE.best.test,rmse.test, rmse.qpoi.test,rmse.test.nb)
coverage = c(a,b,c,d,e)
pvalue = c(pval.lm,pval.best,pval.poi,pval.train,pval.nb)
row =c("apps.lm0","app_best","Apps.poi","Apps.qpoi.2","Apps.nb")
out = data.frame(row.names = row,RMSE_obs,RMSE_test,coverage,pvalue)
kable(out)
RMSE_obs RMSE_test coverage pvalue
apps.lm0 1441.111 2894.803 0.9589744 0.315
app_best 1244.907 2781.898 0.9487179 0.386
Apps.poi 4550.667 5787.812 0.0820513 0.000
Apps.qpoi.2 4550.624 5787.781 0.9641026 0.668
Apps.nb 4550.613 5787.761 0.3333333 0.040

I think app_best, which is the linear regression model with transformations and interactions added. From the table, app_best’s RMSE on the observed data and on the test data are relatively small, compared to the other models. It also has a great coverage of 0.94. Predictive check p-value is also bigger than 0.05, and it appears to be a good fit.

  1. For your “best” model provide a nicely formatted table (use kable() or xtable()) of relative risks and 95% confidence intervals. Pick 5 of the most important variables and provide a paragraph that provides an interpretation of the parameters (and intervals) that can be provided to a university admissions officer about which variables increase admissions.
summary(app_best)
## 
## Call:
## lm(formula = log(Apps) ~ log(Top10perc) + Top25perc + log(F.Undergrad) + 
##     log(P.Undergrad) + Outstate + Room.Board + log(PhD) + Elite, 
##     data = College.train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.52512 -0.23310  0.02069  0.22267  2.09038 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -1.086e+00  2.559e-01  -4.244 2.56e-05 ***
## log(Top10perc)   -1.732e-01  5.089e-02  -3.404  0.00071 ***
## Top25perc         5.779e-03  1.881e-03   3.073  0.00222 ** 
## log(F.Undergrad)  9.778e-01  2.288e-02  42.736  < 2e-16 ***
## log(P.Undergrad) -6.152e-02  1.422e-02  -4.327 1.79e-05 ***
## Outstate          3.134e-05  6.572e-06   4.769 2.35e-06 ***
## Room.Board        9.707e-05  2.051e-05   4.734 2.78e-06 ***
## log(PhD)          1.986e-01  6.711e-02   2.959  0.00321 ** 
## EliteYes          1.828e-01  6.830e-02   2.677  0.00765 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3802 on 573 degrees of freedom
## Multiple R-squared:  0.8734, Adjusted R-squared:  0.8716 
## F-statistic: 493.9 on 8 and 573 DF,  p-value: < 2.2e-16
#since our best model is linear regression model, relative risk doesn't make much sense here, so we just
#provide the coefficient estimates and the confidence intervals.
out = as.data.frame(cbind(app_best$coefficients,confint(app_best)))
kable(out,col.names = c("coef estimates","2.5% CI","97.5% CI"))
coef estimates 2.5% CI 97.5% CI
(Intercept) -1.0858859 -1.5884221 -0.5833496
log(Top10perc) -0.1732362 -0.2731933 -0.0732791
Top25perc 0.0057795 0.0020853 0.0094736
log(F.Undergrad) 0.9777667 0.9328289 1.0227045
log(P.Undergrad) -0.0615160 -0.0894422 -0.0335898
Outstate 0.0000313 0.0000184 0.0000443
Room.Board 0.0000971 0.0000568 0.0001374
log(PhD) 0.1985791 0.0667642 0.3303940
EliteYes 0.1828149 0.0486669 0.3169628
#pick 5 of the most important variables and their confidence intervals
cbind(app_best$coefficients[5:9],confint(app_best,c("log(F.Undergrad)","log(P.Undergrad)","Outstate","Room.Board","log(PhD)")))
##                                        2.5 %        97.5 %
## log(P.Undergrad) -6.151598e-02  9.328289e-01  1.022705e+00
## Outstate          3.134350e-05 -8.944216e-02 -3.358980e-02
## Room.Board        9.707301e-05  1.843528e-05  4.425172e-05
## log(PhD)          1.985791e-01  5.679444e-05  1.373516e-04
## EliteYes          1.828149e-01  6.676415e-02  3.303940e-01

ANSWER:
The five most important variables coefficient estimates are listed in the table above, along with their confidence intervals. The number of full time undergraduates, percentage of facaulty with Phd’s seem to have a positive effect on application, while the number of part time undergraduates and out of state tuition seem to have a negative effect on the number of applications.From the model, Room and board costs seems to have a positive relationship, which appears to contradict with our common sense. However, considering the fact that private and prestigious schools usually have high cost of room and board, it makes sense that higher room and board costs have a positive effect on the number of applications.

Some Theory

  1. We have said that the residual deviance has a \(\chi^2\) distribution with \(n - p - 1\) degrees of freedom (n - the number of parameters) if the model is correct. What is the mean and standard deviation of the residual deviance? Use the Empirical Rule to suggest an easy to use cut-off for suggesting overdispersion that you can calculate without a calculator.

ANSWER: The residual deviance has mean \(n-p-1\) and standard deviation \(\sqrt{2(n-p-1)}\) (basic results from distribution theory). By definition of a \(\Chi^2\), the residual deviance is equal in distribution to \(\sum_{i=1}^{n-p-1} Z_i^2\) where \(Z_i\) are i.i.d. standard normals. By virtue of the CLT, we have for large residual degrees of freedom \(df=n-p-1\), \[ \sqrt{n-p-1} \left( \frac{\sum_{i=1}^{n-p-1} Z_i^2}{n-p-1}-1\right) \approx \mathcal{N}(0, 2).\] Therefore, we have that the residual deviance \(D\) divided by \(df\) satisfies \[\frac{D}{df}-1 \approx \mathcal{N}(0, \frac{2}{df})\] Using the Empirical Rule (95% confidence interval approximated by \(\pm 2\sigma\) for normal), we derive a heuristic cutoff of \[ \frac{D}{f}> 1+ 2 \sqrt{\frac{2}{df}}=1+ \sqrt{\frac{8}{df}}.\]

  1. Gamma mixtures of Poissons: From class we said that \[\begin{align} Y \mid \lambda & \sim P(\lambda) \\ p(y \mid \lambda) & = \frac{\lambda^y e^{-\lambda}} {y!} \\ & \\ \lambda \mid \mu, \theta & \sim G(\theta, \theta/\mu) \\ p(\lambda \mid \mu, \theta) & = \frac{(\theta/ \mu)^\theta}{\Gamma(\theta)} \lambda^{\theta - 1} e^{- \lambda \theta/\mu} \\ & \\ p(Y \mid \mu, \theta) & = \int p(Y \mid \lambda) p(\lambda \mid \theta, \theta/\mu) d \lambda \\ & = \frac{ \Gamma(y + \theta)}{y! \Gamma(\theta)} \left(\frac{\theta}{\theta + \mu}\right)^{\theta} \left(\frac{\mu}{\theta + \mu}\right)^{y} \\ Y \mid \mu, \theta & \sim NB(\mu, \theta) \end{align}\] Derive the density of \(Y \mid \mu, \theta\) in (8) showing your work using LaTeX expressions. (Note this may not display if the output format is html, so please use pdf.) Using iterated expectations with the Gamma-Poisson mixture, find the mean and variance of \(Y\), showing your work. \[\begin{align} p(Y \mid \mu, \theta) & = \int p(Y \mid \lambda) p(\lambda \mid \theta, \theta/\mu) d \lambda \\ & = \int \frac{\lambda^y e^{-\lambda}} {y!}\frac{(\theta/ \mu)^\theta}{\Gamma(\theta)} \lambda^{\theta - 1} e^{- \lambda \theta/\mu} d \lambda \\ & = \frac{(\theta/ \mu)^\theta}{\Gamma(\theta)y!} \int\lambda^{y+\theta - 1} e^{- \lambda (\theta/\mu+1)} d \lambda \\ & = \frac{(\theta/ \mu)^\theta}{\Gamma(\theta)y!} \frac{\Gamma(y+\theta)}{(1+\theta / \mu)^{\theta+y}} \\ & = \frac{ \Gamma(y + \theta)}{y! \Gamma(\theta)} \left(\frac{\theta}{\theta + \mu}\right)^{\theta} \left(\frac{\mu}{\theta + \mu}\right)^{y} \\ E(Y) &= E[E(Y \mid \lambda)] \\ &= E[\lambda] \\ &= \frac{\theta}{\theta / \mu} \\ &= \mu \\ Var(Y) &= E[Var(Y \mid \lambda)] + Var[E(Y \mid \lambda)] \\ &= E(\lambda) + Var(\lambda)\\ & = \mu + \frac{\theta}{\theta^2 / \mu^2}\\ & = \mu + \frac{\mu^2}{\theta} \end{align}\]